\documentclass[a4paper,12pt]{report}
\usepackage{graphics}
\usepackage{graphicx}
\usepackage{amsthm}
\renewcommand{\thesection}{\arabic{section}}

\begin{document}

\begin{center}
{\bf\Large Trip Purpose Identification using Cell Phones}\\\

{\bf\Large Intern: Shuai Zheng}\\\

{\Large August, 2011}\\\

\end{center}


\section*{\centering{\uppercase{Abstract}}}

Trip purpose provides information about what residents use the road network for. It is important for urban planning and traffic operations. Previous work used GPS data to supplement the travel-diary-based trip purpose research. Recently, researchers started to try to eliminate the use of travel diary and to use GPS data and detailed point of interest (POI) information to study trip purpose. However, POI is not usually sufficient for purpose identification especially when POI is incomplete or out-of-date, or the location is inaccurate. In our work, we have adopted data mining methods to identify home and work location and to infer the purpose of a trip. Associations among trips, along with other trip features have been considered in purpose identification. Furthermore, this approach has been applied to identify trip purposes from Call-Detailed Records (CDR), which have very low frequency and accuracy. The results on real and simulated datasets show the effectiveness of the proposed approach on various location accuracies and frequencies.

\section{Introduction}

Trip purpose research provides information about purpose of people using the transportation networks. First, it is important for urban planning. For example, through trip purpose research, the city officers can know how long people need to travel to shopping every day. According to this information, government can build shopping markets, banks, and other facilities in the right place for communities. Second, it plays an great role in traffic operation. For example, in case of hurricane and storm, or earthquake, the high way network is destroyed. Emergence department can decide which routes should be cleaned at high priority. For large cities, such as London, which will host the 2012 Olympic Games, huge number of people will visit London during the Games. This will put heavy pressure for traffic operation. Trip purpose research can help city officers to decide which route is used for what purpose, then assign different priorities for different people to use the roads.

\section{Related Work and Background}

\begin{figure}[htbp]
\centering
\includegraphics{pre_method}
\caption{Traditional travel-diary-based research.}
\label{fig:pre_method}
\end{figure}

Household travel data collection methods have evolved over time. Early surveys were conducted using paper-and-pencil interview (PAPI) methods in the form of mail-out and mail-back surveys with in-home interviews. During the 1980’s and 1990’s, most travel surveys replaced the mail-back portion of data retrieval with computer-assisted telephone interviews (CATI). Most recently, computer-assisted-self-interview (CASI) methods, in which respondents record their responses directly into a computer (desktop, laptop, or handheld), are being implemented. This type of research is travel-diary-based research, as in Figure \ref{fig:pre_method}. It requires a lot of people participation. Since people usually do not want to take into these surveys, usually there is not sufficient data for research use.

\begin{figure}[htbp]
\centering
\includegraphics{wolf}
\caption{Wolf method.}
\label{fig:wolf}
\end{figure}

Global Positioning System (GPS) technologies, which provide second-by-second position data with accuracies of three to five meters, as well as highly accurate velocity and time data, introduce a whole new level of comprehensiveness and accuracy to travel surveys. At the beginning, researchers used GPS data to supplement the travel-diary-based research. Recently, researchers start to only use GPS data to do trip purpose research. In 2001, Wolf proposed a way to determine the purpose of a trip by using the land use information or “Point Of Interest” (POI) of the destination of a trip. As in Figure \ref{fig:wolf}, if the destination point is in a shopping area, then this trip is classified as shopping trip.

\begin{figure}[htbp]
\centering
\includegraphics{terry}
\caption{Terry method.}
\label{fig:terry}
\end{figure}

In 2008, Terry used machine learning cluster method to group all the destination points into different clusters, see Figure \ref{fig:terry}. Then he determined the purpose of trips using the POI information in the clusters. Terry’s method works for a lot of trips, but Wolf’s work only works for every single trip.

From the related work and background, we can see that previous methods require “Point Of Interest” (POI) information. Their methods have high requirement for POI. However, POI is usually incomplete, out of dated, and ambiguous (when some organizations are in the same building). We want to design a classification approach which does not require the information of POI. 

\section{Our Approach}

\begin{figure}[htbp]
\centering
\includegraphics[width=0.75\textwidth]{our_approach}
\caption{Approach overview.}
\label{fig:our_approach}
\end{figure}

As in Figure \ref{fig:our_approach}, we use GPS data and POI info to experiment on a rule-based trip purpose identification approach (POI required). Using the results of rule-based approach, we validate all the trips to make sure that those trips are reasonable. Then we assume these human validated results as ground truth. We use those results to do the classification approach (no POI required). We also simulate CDR data from GPS data. We simulate CDR triangulation and CDR single cell data and experiment our approach on these simulation data. The rule-based approach prepares training data for the classification approach.

\section{Data Source}

\begin{figure}[htbp]
\centering
\includegraphics{data}
\caption{One user's GPS data.}
\label{fig:data}
\end{figure}

In this work, we use five users GPS data from the city of Dubuque, IA. Every user has 7 continuous days’ GPS data collected in June 2011. Over 100 position points were collected in every day. Figure \ref{fig:data} is the distribution of one user 7 days’ GPS data. The useful features for us are GPS collection time, and GPS latitude and longitude. Point Of Interest information, including school and shopping, is used do the rule-based approach only.

\section{GPS Data Model Design and Results}

\subsection{GPS Data Rule-Based Approach}

\begin{figure}[htbp]
\centering
\includegraphics{move}
\caption{Move.}
\label{fig:move}
\end{figure}

\begin{figure}[htbp]
\centering
\includegraphics{stop}
\caption{Stop.}
\label{fig:stop}
\end{figure}

In order to perform a classification approach, we need to have training data including the trip properties and trip classes. We will use every row to represent a trip. So the first work is to segment all the GPS data into single trips. We use the following rule to define whether the user is moving or stopping: In 10 minutes, if the distance of this user travels is more than 120 meters, then we define this user is moving, as in Figure \ref{fig:move}; if the distance of this user travels is less than 120 meters, then we define this user is stopping, as in Figure \ref{fig:stop}. 

\begin{figure}[htbp]
\centering
\includegraphics{cdf}
\caption{CDF distribution of GPS collection time intervals.}
\label{fig:cdf}
\end{figure}

10-minite limit is from the CDF distribution of the GPS collection time intervals, as in Figure \ref{fig:cdf}. 120-meter limit is from the accuracy of GPS devices. After trip segmentation, we will have the start points and end points for all trips. Through this information, we can caculate the trip duration time, trip start time, trip distance, trip weekday and so on.

From the trip segmentation results, we can have features of all trips. We finally will classify all the trips into home-based work, school, shopping, other, and non-home-based work, school, shopping, other, and home destination (all trips that have home location as destinations). This trip class requirement is required by the Department of Transportation. The location information of home, work, school (given in POI), shopping (given in POI) is important.

\begin{itemize}

\item Home: For one user, we collect all the last GPS point location of every day. Then we calculate the median of these locations.
\item Work: For one user, we collect the GPS location points where the stay time is more than 2 hours and less than 8 hours, OR the points that this user visited more than 3 times in one day. Then we calculate the median of these locations.
\item School: Even we have already known school locations from POI, there are some other school trips which might be missed, for example pick-up and drop-off around a school. So we define this is a school trip, if 1. the destination is school, and the trip end time is between 7:30AM and 9:30AM or between 3:00PM and 6:00PM in work days, and more than 1 visit in one week; 2. there is a “U” turn around a school location, and more than 1 visit in one week.

\end{itemize}

\subsection{GPS Data Classification Approach}

With the home, work, school and other POI information, and the trip segmentation results, we now can classify all the trips in to home-based work, school, shopping, other, and non-home-based work, school, shopping, other, and home destination 9 classes.

For home work location detection, we will use features: Stay time, Start time, Week day and Number of visit. For trip purpose detection, we will use features: Trip duration time, Start time, Week day, Destination number of visit, Destination stay time, Trip order, Distance of start location to home, Distance of start location to work, Distance of end location to home, Distance of end location to work. The algorithms are as following:

\begin{itemize}
\item Neural Network (MATLAB Neural Network Toolbox, 10 random running, 20 layers, 80% training, 10% validating, 10% testing)
\item SVM (LIBSVM, NTU, 10-fold cross-validation)
\item KNN (MATLAB Bioinformatics Toolbox, 10-fold cross-validation, N=3)
\item Decision Tree (MATLAB Statistics Toolbox, 10-fold cross-validation, CART)
\end{itemize}

Consider the fact that if the first trip is home to school, then the next trip must not be home-based trips, because the user currently is in school. We add an enhancement to our model: For every prediction, there is a confidence score, if this confidence score is less than 0.8 and less than the last trip confidence and the next trip confidence, then we will add last trip class and next trip class into training data to calculate this trip class again. This enhancement is very limited, because we only update two numbers: last trip class and next trip class. 

\subsection{GPS Data Classification Results}

\begin{figure}[htbp]
\centering
\includegraphics{gps1}
\caption{GPS trip classification error rate.}
\label{fig:gps1}
\end{figure}

From Figure \ref{fig:gps1}, we can see that Decision Tree and Neural Network have the best results (error rate is less than 10\%) for home work location detection. However, only Decision Tree has error rate less than 10\% for trip class classification.

\begin{figure}[htbp]
\centering
\includegraphics{gps2}
\caption{GPS enhanced trip classification error rate.}
\label{fig:gps2}
\end{figure}

From Figure \ref{fig:gps2}, we can find the enhancement does not have good improvement for trip class detection. Except for the reason that we only update two numbers: last trip class and next trip class, the trip class training data is not reliable. Meanwhile, after 10-fold, the test data is only 10\% of the total data. We will test on more data in the future.

\section{CDR Data Simulation and Reslts}

CDR (Car-Detail-Records) data is collected by cell phone carrier. CDR data has low frequency (around 50 collections every day), compared to over 100 points per day of GPS data. There are two types of CDR data: 1. Determine the user’s location by using three cell phone towers or stations (error is from 50 meters to 200 meters); 2. Represent user’s location using the nearest cell phone towers.

\subsection{CDR Triangulation}

\begin{figure}[htbp]
\centering
\includegraphics{cdrTri}
\caption{CDR triangulation data simulation.}
\label{fig:cdrTri}
\end{figure}

We randomly select 50 points per day from the GPS data, and then add random error from 0 to 0.001 to the latitude and longitude of GPS Data, as in Figure \ref{fig:cdrTri}. After we have this simulation data, we use the same rules as the rules for GPS data to detect home work location. When determing trip class, we use last-overlap method. For example in Figure \ref{fig:cdrTri1}, point "other" is deleted randomly during the simulation. In CDR triangulation, we will classify the trip from "home" to "work" as non-home based work.

\begin{figure}[htbp]
\centering
\includegraphics{cdrTri1}
\caption{CDR triangulation trip class defination.}
\label{fig:cdrTri1}
\end{figure}

Now CDR triangulation data is ready for training. Using the same machine learning classification approach as we did in GPS data, Figure \ref{fig:cdrTri2} shows CDR triangulation trip classification error rate. Decision Tree has the best results for home work location detection around 15\%. Decision Tree also has the best results for trip class classification, around 40\%. The random assigned class error is high. Figure \ref{fig:cdrTri3} shows CDR triangulation enhanced trip classification error rate. Enhanced algorithm produces almost the same results as the non-enhanced algorithm. For Decision Tree, enhanced algorithm has a little improvement for non-enhanced algorithm.


\begin{figure}[htbp]
\centering
\includegraphics{cdrTri2}
\caption{CDR triangulation trip classification error rate.}
\label{fig:cdrTri2}
\end{figure}

\begin{figure}[htbp]
\centering
\includegraphics{cdrTri3}
\caption{CDR triangulation enhanced trip classification error rate.}
\label{fig:cdrTri3}
\end{figure}


\subsection{CDR Single Cell}

\begin{figure}[htbp]
\centering
\includegraphics{cdrCell}
\caption{CDR single cell trip class defination.}
\label{fig:cdrCell}
\end{figure}

We manually selected 12 cell phone stations for the city of Dubuque, IA. Then we randomly select 50 points per day from the GPS data. CDR Single Cell uses the nearest cell phone stations to represent the location of the 50 points, as in Figure \ref{fig:cdrCell}. After we have this simulation data, we use the same rules as the rules for GPS data to detect home work location. When determing trip class, we use last-overlap method. In Figure \ref{fig:cdrCell1}, even actually the trip is from "home" cell to "other" cell, we will classify the trip as "other" to "other" as non-home based work. This is from the GPS data,  which we think as ground truth. Because some cells are large enough to cover many places, like school and shopping, we classify all the trips into home-based work, other, non-home-based work, other, and home destination, 5 classes in total. 

\begin{figure}[htbp]
\centering
\includegraphics{cdrCell1}
\caption{CDR single cell trip class defination.}
\label{fig:cdrCell1}
\end{figure}

Now CDR single cell data is ready for training. Using the same machine learning classification approach as we did in GPS data, Figure \ref{fig:cdrCell2} shows CDR single cell trip classification error rate. Decision Tree and Neural Network have better results for home work location detection. Decision Tree also has the best results for trip class classification. We can see that the trip class error for CDR single cell is less than that of CDR triangulation, because the number of trip classes is only 5 for CDR single cell, however, the number of trip classes is 9 for CDR trangulation. The random assigned class error is high. Figure \ref{fig:cdrCell3} shows CDR single cell enhanced trip classification error rate. Enhanced algorithm produces almost the same results as the non-enhanced algorithm.

\begin{figure}[htbp]
\centering
\includegraphics{cdrCell2}
\caption{CDR single cell trip classification error rate.}
\label{fig:cdrCell2}
\end{figure}

\begin{figure}[htbp]
\centering
\includegraphics{cdrCell3}
\caption{CDR single cell enhanced trip classification error rate.}
\label{fig:cdrCell3}
\end{figure}

\section{Conclusion and Future Directions}

In this work, we designed a rule-based trip purpose approach to detect home work location and trip purpose detection. Using the results of rule-based approach, we use Neural Network, SVM, KNN and Decision Tree to experiment a classification approach, which eliminate the use of POI. We also simulate CDR triangulation data and CDR single cell data and test our model on the CDR data.

In the future, we will continue to optimize the algorithm to decrease CDR trip purpose classification error rate. What mode does people take, walk, bike, private car, or public bus? We would like to use temproal algorithms like Hidden Markov model.

\end{document}






















